contract,transport: reduce log spam during backend outages#180
contract,transport: reduce log spam during backend outages#180full-bars wants to merge 4 commits into
Conversation
During URnetwork outages or degraded backend conditions, two log lines spam the console on every retry cycle, making it difficult to observe other events: [contract]oob err = Timeout [t]auth error <id> = No successful strategy found. This commit addresses both. contract: OOB error backoff (ryanmello07) Adds SignStoredContract/VerifyStoredContract helpers and a NetworkEventTimeChangeHmac cutover (2026-07-01) for a planned HMAC format migration - verifiers accept both legacy and standard forms at all times so the rollout is asymmetric. Adds CreateContractOobErrorBackoff (default 1 min) to ContractManagerSettings. After the first OOB error, CreateContract returns early for the duration of the backoff window. The log line fires at most once per minute instead of once per in-flight request. transport: auth error logged on state transition only runH1 and runH3 both retry the platform connection in a tight loop. Previously every failed attempt emitted an INFO log line. Now the first failure logs at INFO; subsequent retries in the same failure run are demoted to V(1). The flag resets on successful connect so the next disconnection surfaces at INFO again.
|
Testing update — moving to draft Ran a live outage simulation against this PR on a Detroit server running 200 DC proxies. After ~13 hours of warmup the provider reached 877–2,143 active connections with sustained throughput of 1.4 GiB/hr. During that window there was also a real platform OOB condition lasting 10+ hours, which gave extended natural validation of the What the test found: The
Fixes applied in this update:
Next steps: Rebuilding the test binary and restarting the container for a second full warmup + stress test cycle. Moving to draft until that validates. |
markCreateContractOobError now atomically checks and sets the backoff inside a single mutex lock, preventing duplicate log lines when two goroutines race through the check simultaneously. PlatformTransport gains shouldLogAuthErr() which rate-limits [t]auth error logging to once per minute across all transport goroutines on the same instance, preventing log floods when many transports time out simultaneously. Per-transport authErrLogged is preserved for within-session suppression.
Both log lines were rate-limited at the wrong scope. Each proxy creates its own ContractManager and PlatformTransport, so per-instance limiters allow 1000 instances to all log simultaneously during an outage. Switches both to package-level atomic.Int64 CAS rate limiters (max one log line per minute globally across all instances). Suppressed count is reported on each emitted line so operators know how many errors were hidden. Per-instance OOB backoff still gates contract creation. Verified under live stress testing with 1000 proxies + tc netem: - [t]auth error: exactly 1 line/min, suppressed counts up to 3,980 - [contract]oob err: 0-1 lines/min, suppressed count reported correctly
|
"No Successful Strategy Found" errors were not reproducible under netem-based stress testing. Even at 100% packet loss, the provider enters transport auth timeout mode before HTTP-layer strategy selection is reached. This condition may require a different trigger (e.g. routes present but all destination peers unavailable) rather than raw network blackout. It remains a known log spam candidate for a follow-up PR once a reliable reproduction method is established. The timeout spam addressed here ( |
Connection drop errors flood logs during backend outages without suppression, unlike [t]auth error and [contract]oob err. Implement identical package-level atomic rate-limiting pattern: max one message per minute globally, suppression count reported. Validated in production incident 2026-05-30.
Production ValidationThis PR has been validated in production during a recent backend API outage (2026-05-29 approx 3:10 - 3:21PM, ~11 minutes duration). Analysis of ~120 servers across multiple geographic regions (North America: New York, Los Angeles, Houston, Dallas, Denver, Chicago, Kansas City, Atlanta; Europe: London, Berlin, Stockholm) confirmed the existing fixes for 1. [r]drop Error Rate-LimitingToday's outage revealed that connection drop errors ( 2. Bandwidth Leak Prevention (Deferred)Today's incident also validated the anti-bandwidth-leak measures needed to address issues #175 and #181. I implemented and proved these mechanisms in the custom build, but they require foundational changes to the contract and transfer systems that will be more effective as a dedicated PR. This allows proper testing of the degradation detection thresholds and throttling behavior without conflating them with log-spam fixes. Lessons Learned - Faster Data CollectionThis incident also informed improvements to how I collect incident data. I developed an automated baseline collection system that gathers version info, system metrics, bandwidth history, and logs from roughly 120 servers in under 60 seconds. This means future outages will have complete fleet-wide context captured immediately, enabling faster root cause analysis and validation of fixes like those in this PR. :) |
Problem
During platform outages or degraded connectivity, two log lines flood terminal output at high volume, making it impossible to read logs or assess provider health:
```
[t]auth error = Timeout. # one per transport, every ~5s
[contract]oob err = Timeout.; backing off create contract OOB requests for 1m0s # one per ContractManager instance
```
With 1000 proxies running, each proxy spawns its own `PlatformTransport` and its own `ContractManager`. Under a platform outage all 1000 can log simultaneously and continuously.
Root cause
Both error paths had rate limiters scoped to the wrong level:
The fix in both cases: package-level `atomic.Int64` shared across all instances, so the entire process emits at most one log line per minute regardless of how many proxies are running.
Changes
`transport.go`
`transfer_contract_manager.go`
Stress test results
Tested on a live Detroit provider running 1000 proxies using `tc netem` applied via `nsenter` into the container network namespace. Five stages run across two builds.
Stages 1–2 — `netem delay 2000ms loss 50%` (8 minutes each)
Auth fix and OOB fix both verified. Suppressed counts climbed from 52 → 925 → 3,030 → ~3,960 as all 1000 proxies cascaded into timeout simultaneously.
Stages 3–5 — `netem loss 95% / 99% / 100%` (10 minutes each)
Organic validation
Under real peak load (no netem), observed: `[t]auth error ... (101 suppressed)` and `[t]auth error ... (3,952 suppressed)` — single log lines absorbing thousands of simultaneous transport failures.
Expected behavior after fix